Goal

We are trying to determine which products have high association i.e. those that are likely to be bought together

We will imPlement this using Market Basket Analysis (MBA), which uses Association Rule Mining (Apriori algorithm)

Association Rules

\(M_1 \rightarrow M_2\) i.e. representation of having item \(M_2\) on the itemset which has \(M_1\) on it

Importing Data

file_path <- "Import files/Market_Basket_Optimisation.csv"

data <-  read_csv(file_path, col_names = FALSE)

Each row represents the transactions for individual customers

Data Preprocessing

We convert the data frame to a sparse matrix (called transactions)

Sparse matrix is a matrix of 0s and 1s, with each row and column representing the various products

library(arules)

dataset <- read.transactions(file_path, sep = ",", rm.duplicates = TRUE )
## distribution of transactions with duplicates:
## 1 
## 5
summary(dataset)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Creating a frequency plot of the sparse matrix (Top 20 most bought products)

library(RColorBrewer)
itemFrequencyPlot(dataset,
   topN=20,
   col=brewer.pal(8,'Pastel2'),
   main='Relative Item Frequency Plot',
   type="relative",
   ylab="Item Frequency (Relative)") 

Training the Apriori Model on the dataset

The basic steps in implementing the Apriori algorithm are as follows:

  1. Set up a minimum support and confidence

  2. Take all the subsets in transactions having higher support than the minimum support

  3. Take all the subsets in transactions having higher confidence than the minimum confidence

  4. Sort the rules by decresing lift

The choice of support(how frequently the item appears in your data set) and confidence (frequency of the rule) varies by business case: depends on the goal, data size etc

For minimum support, we want products that are bought at least two times a day i.e. 2*7/len(dataset).

Minimum length is an specifies the minimum number of products you’d like to have in your rule (Not mandatory to include this)

Maximum length specifies the maximum number of products you’d like to have in your rule (Not mandatory to include this)

rules <- apriori(dataset, 
                 parameter = list(support = 14/nrow(dataset),
                                  confidence = .2,
                                  minlen = 2,
                                  maxlen = 20
                                  )
                 )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime     support minlen
##         0.2    0.1    1 none FALSE            TRUE       5 0.001866418      2
##  maxlen target  ext
##      20  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [116 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [3193 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Removing Redundant Rules
#rules <- rules[!is.redundant(rules)]
subset.rules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
rules <- rules[-subset.rules] # remove subset rules.

Rules Arranged by Lift (i.e. starting with those with high association)

rules_df <- DATAFRAME(rules) %>% arrange(desc(lift))

DT::datatable(rules_df)

Visualizing the results (rules)

Relationship between support, confidence and lift

library(arulesViz)

plot(rules,jitter = 0, engine = "plotly")
  • Rules with high confidence tend to have low support, and vice versa

  • Rules with high lift tend to have relatively low support

Relationship between support, confidence and Number of items in the rule (order)

plot(rules, method = "two-key plot")

  • The order and support have a very strong inverse relationship i.e. as order increases, the support decreases

Network Graph Visualisation

subrules <- head(sort(rules, by="lift"), n = 30, by = "lift")

#plot(subrules, method = "graph", engine = "htmlwidget")

plot(subrules, method = "graph",
     control = list(
       # edges = ggraph::geom_edge_link(
       #   end_cap = ggraph::circle(4, "mm"),
       #   start_cap = ggraph::circle(4, "mm"),
       #   color = "black",
       #   arrow = arrow(length = unit(2, "mm"), angle = 20, type = "closed"),
       #   alpha = .2
       # ),
       nodes = ggraph::geom_node_point(aes_string(size = "support", color = "lift"))
       #nodetext = ggraph::geom_node_label(aes_string(label = "label"), alpha = .8, repel = TRUE)
     )
) +
  scale_color_gradient(low = "dodgerblue", high = "red") +
  scale_size(range = c(2, 10))

# Shiny App for Interactive Manipulations and Visualization
#ruleExplorer(subrules, sidebarWidth = 2, graphHeight = '600px')